Biostatistics For Dummies, 2nd Edition (Monika Wahi, John Pezzullo)

236 PART 5 Looking for Relationships with Correlation and Regression

Executing a Multiple Regression

Analysis in Software

Before executing your multiple regression analysis, you may need to do some prep

work on the variables you intend to include in your model. In the following sec-

tions, we explain how to handle the categorical variables you plan to include. We

show you how to examine these variables through making several charts before

you run your analysis. If you need guidance on what variables to consider for your

models, read Chapter 20.

Preparing categorical variables

The predictors in a multiple regression model can be either numerical or categori-

cal (Chapter 8 discusses the different types of data). In a categorical variable, each

category is called a level. If a variable, like Setting, can have only two levels, like

Inpatient or Outpatient, then it’s called a dichotomous or a binary categorical vari-

able. If it can have more than two levels, it is called a multilevel variable.

Figuring out the best way to introduce categorical predictors into a multiple

regression model is always challenging. You have to set up your data the right

way, or you’ll get results that are either wrong, or difficult to interpret properly.

Following are two important factors to consider.

Having enough participants in each level

of each categorical variable

Before using a categorical variable in a multiple regression model, you should

tabulate how many participants (or rows) are included in each level. If you have

any sparse levels — row frequencies in the single digits — you will want to con-

sider collapsing them into others. Usually, the more evenly distributed the num-

ber of rows are across all the levels, and the fewer levels there are, the more

precise and reliable the results. If a level doesn’t contain enough rows, the pro-

gram may ignore that level, halt with a warning message, produce incorrect

results, or crash. Worse, if it produces results, they will be impossible to interpret.

Imagine that you create a one-way frequency table of a Primary Diagnosis vari-

able from a sample of study participant data. Your results are: Hypertension: 73,

Diabetes: 35, Cancer: 1, and Other: 10. To deal with the sparse Cancer variable, you

may want to create another variable in which Cancer is collapsed together with

Other (which would then have 11 rows). Another approach is to create a binary

variable with yes/no levels, such as: Hypertension: 73 and No Hypertension: 46.

But binary variables don’t take into account the other levels. You could also make